Initial Loading
?library
## starting httpd help server ... done
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggdist Loading
library(ggdist)
To figure out how to load a library in RStudio, I used a quick Google search. I entered “how to load libraries in R” into the searched bar and opened the first two links, both of which explained to use the “library()” function. Running an additional code block using the help function certified the contents of those link. After adding it into the code block and running it, I was able to see that the library was successfully loaded into RStudio. I reset R and cleared the outputs and ran it once more to be sure.
https://bookdown.org/nana/intror/install-and-load-packages.html https://campus.datacamp.com/courses/intermediate-r/chapter-3-functions?ex=17
?tidyverse
Description
The ‘tidyverse’ is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step. Learn more about the ‘tidyverse’ at https://www.tidyverse.org.
MFID_analogy_read <- read_csv("/Users/vince/OneDrive/Documents/GitHub/Data2SciComm/WeeklyAssignments/tidy_data/MFIndD_analogy.csv")
## Rows: 792 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): qualtrics_id, match_type_0, match_type_1, response_type
## dbl (2): trial_number, response
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
MFID_analogy_read
2a. There are 792 rows and 6 columns in the dataset. I did not use a specific code chunk for this information because the number of rows and columns are provided when I initially read in the csv file for number 1.
MFID_analogy_read %>%
distinct(qualtrics_id) %>%
count()
2b. Using the distinct function, I was able to filter out repeated qualtrics ID’s for multiple trials of the same respondee. Therefore, there are 99 individuals in the dataset, as seen in the code chunk above.
MFID_analogy_read %>%
group_by(qualtrics_id) %>%
summarize(n_trials = n())
2c. As seen in the code chunk above, each of the 99 participants complete 8 trials, which was found using the count function n(). Therefore, every participant has the same number of trials.
MFID_summary <- MFID_analogy_read %>%
group_by(qualtrics_id) %>%
summarize(relational_matches = sum(response_type == "Rel"))
MFID_summary #make it show up
MFID_summary %>%
ggplot(aes(x = relational_matches)) +
geom_histogram(binwidth = 1, fill = "gray", color = "black") +
labs(title = "Distribution of Relational Matches",
x = "Number of Relational Matches",
y = "Number of Participants")
MFID_analogy_read %>%
group_by(qualtrics_id) %>%
select(-match_type_0, -match_type_1, -response) %>% #taking additional info out
pivot_wider(names_from = trial_number,
values_from = response_type,
names_prefix = "trial ") %>%
select("trial 1", "trial 2", "trial 3", "trial 4", "trial 5", "trial 6", "trial 7", "trial 8") #reordering to make it look better
## Adding missing grouping variables: `qualtrics_id`
MFID_analogy_read %>%
group_by(qualtrics_id) %>%
pivot_wider(names_from = trial_number,
values_from = response_type) %>%
select(-match_type_0, -match_type_1, -response) %>% #remove these columns
select(1, 2, 3, 4, 5, 6, 7, 8)
Show My Work
This code chuck did work, just not in the way that I intended it to. After removing the unnecessary columns, I wanted to reorder the columns so that it would start at trial 1 and increment to trial 8. In the corrected code chunk, I added a prefix of “trials” so that I would be more specific. I then realized that I am still a little unfamiliar with the syntax for certain functions, thus forgetting to put the column names in quotations.
MFID_REI_read <- read_csv("/Users/vince/OneDrive/Documents/GitHub/Data2SciComm/WeeklyAssignments/tidy_data/MFIndD_REI.csv")
## Rows: 4000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): qualtrics_id, sub_type, rev_scoring, response
## dbl (2): item_number, scored_response
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(MFID_REI_read)
MFID_REI_read
MFID_REI_read_fixed <- MFID_REI_read %>%
mutate(response = case_when(response == 2 ~ 2,
response == 3 ~ 3,
response == 4 ~ 4,
response == "Strongly Agree" ~ 5,
response == "Strongly Disagree" ~ 1))
MFID_REI_read_fixed
MFID_REI_read_mut <- MFID_REI_read_fixed %>%
mutate(new_scored_response = case_when(is.na(rev_scoring) == TRUE ~ response ,
rev_scoring == "neg" ~ 6 - response))
MFID_REI_read_mut
MFID_REI_read_mut %>%
mutate(resp_check = case_when(new_scored_response == scored_response ~ TRUE,
new_scored_response != scored_response ~ FALSE)
) %>%
distinct(resp_check)
MFID_REI_read_mut %>% distinct(sub_type)
REI_summary <- MFID_REI_read_mut %>%
group_by(qualtrics_id, sub_type) %>%
summarise(total_score = sum(new_scored_response))
## `summarise()` has grouped output by 'qualtrics_id'. You can override using the
## `.groups` argument.
REI_summary
na <- REI_summary %>%
filter(is.na(total_score))
na
1b. Yes, there are scores that are NA.
REI_sum_notna <- MFID_REI_read_mut %>%
group_by(qualtrics_id, sub_type) %>%
summarise(total_score = sum(new_scored_response, na.rm = TRUE))
## `summarise()` has grouped output by 'qualtrics_id'. You can override using the
## `.groups` argument.
REI_sum_notna
na2 <- REI_sum_notna %>%
filter(is.na(total_score))
na2
MFID_combined <- REI_sum_notna %>%
inner_join(MFID_summary)
## Joining with `by = join_by(qualtrics_id)`
MFID_combined
ggplot(MFID_combined, aes(x = total_score,
y = relational_matches,
color = sub_type)) +
geom_point() +
geom_smooth() +
labs(title = "Relation Between REI Score and Analogy Score",
x = "REI Total Score",
y = "Analogy Score") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
MFID_prob <- MFID_analogy_read <- read_csv("/Users/vince/OneDrive/Documents/GitHub/Data2SciComm/WeeklyAssignments/tidy_data/MFIndD_probtask.csv")
## Rows: 26136 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): SubID, condition, discreteness, regularity, integration, major_col...
## dbl (6): rt_toolong, rt_tooshort, rt_exclu, trial_index, block, rt
## lgl (1): correct
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
MFID_prob
2a. create a vector
conditions <- MFID_prob %>%
distinct(condition) %>%
.$condition
conditions # vector to store conditions
## [1] "dots_EqSizeSep" "dots_EqSizeRand" "blob_shifted" "blob_stacked"
2b. calculate mean
mean_rt <- numeric(length(conditions))
#filter invalid trials
valid_trials <- MFID_prob %>%
filter(rt_exclu == 0)
for (i in seq_along(conditions)) {
condition_data <- valid_trials %>%
filter(condition == conditions[i])
mean_rt[i] <- mean(condition_data$rt, na.rm = TRUE) #put the calculated mean in the mean_rt vector
}
mean_rt
## [1] 889.8202 879.6184 915.1974 866.0546
logical explanation:
create a vector for the mean reaction times. in order to find the mean of each condition, i first need to filter out all of the valid trials (where rt_exclu == 0). then i can loop per condition and check where the condition in the valid trials matches one of the conditions in our conditions vector. after that i need to calculate the mean for all of the reactions of that specific condition and put that in our mean reaction time vector.
summary_across <- valid_trials %>%
group_by(condition) %>%
summarize(across(c(correct, rt))) %>%
summarize(mean_rt = mean(rt, na.rm = TRUE), # find the mean
overall_accuracy = mean(correct, na.rm = TRUE)) # find the accuracy
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'condition'. You can override using the
## `.groups` argument.
summary_across
summary_without <- valid_trials %>%
group_by(condition) %>%
summarize(mean_rt = mean(rt, na.rm = TRUE), # find the mean
overall_accuracy = mean(correct, na.rm = TRUE)) #find the accuracy
summary_without
MFID_prob
MFID_prob %>%
group_by(condition) %>%
summarise(across(c(rt, correct), mean)) %>%
pivot_longer(c(rt, correct)) %>%
ggplot(aes(y = value, x = condition)) +
geom_point(color = "red") +
facet_wrap(~name, scales = "free")
1a. From the plot, I assume that a higher proportion of correct scores results from a better reaction time or vice versa. Blob_stacked and dots_EqSizeRand have higher accuracy and a faster reaction time, while blob_shifted and dots_EqSizeSep are lower in accuracy with a slower reaction time.
1b. The first thing that I noticed were the values in the axes of the graphs. Typically, graphs with great variability tend to have large ranges that contribute to conclusions about relationships. For example, I expected to see something along the lines of a higher accuracy results in a slower reaction time. More specifically, the blob_stacked column stood out as it had the highest accuracy but the fastest reaction time.
1c. Some information that I find difficult to see is a clear relationship between the accuracy of the conditions and the mean reaction time. The conclusion I came to in 1a is a pure assumption from looking at it, but I feel as though there are ways to visualize the information better. Further descriptive information of the graph and its axes could also be helpful. There is no title to describe what data we are looking at, and the y-axis in particular lacks information about what the values for each graph are.
2a. summarize the data set
sum_prob <- MFID_prob %>%
group_by(SubID, condition) %>%
summarize(prop_corr = mean(correct))
## `summarise()` has grouped output by 'SubID'. You can override using the
## `.groups` argument.
sum_prob
2b. make the plot
sum_prob %>%
group_by(condition) %>%
summarize(mean_prop_corr = mean(prop_corr, na.rm = TRUE)) %>%
ggplot(aes(y = mean_prop_corr,
x = condition),
scale = 0:1) +
stat_summary(fun = mean, geom ="point", color = "red") +
theme_minimal() +
labs(x = "Condition",
y = "Mean Proportion of Correct",
title = "Mean Accuracy of Responses by Condition on Probability Task") +
coord_cartesian(ylim = c(0:1))
2c. To make the graph less misleading, I added a descriptive title and y-axis so the reader knows the graph that they are looking at and what information is being displayed on it. I also changed the scale so that it better shows that the y-axis is a proportion.
3a. add distributional info
sum_prob %>%
ggplot(aes(y = prop_corr,
x = condition)) +
stat_summary(fun = mean, geom = "point", color = "red", size = 3) +
geom_dotsinterval(dotsize = 0.5,
slab_color = "lightblue",
slab_alpha = 0.6) +
theme_minimal() +
labs(x = "Condition",
y = "Mean Proportion of Correct",
title = "Mean Accuracy of Responses by Condition on Probability Task") +
ylim(0,1)
3b. What does this graph tell us: From my understanding, the dots represent individual cases of proportion correct for the different SubID’s. Thus, it allows us to see the variability across the data set and where responses were more concentrated. Overall, it gives us more information about why the mean proportion of correct for each condition is the way it is.
Aesthetics: As seen above, I changed the scale of the y-axis, added a descriptive title and y-axis title, set the theme to minimal. I made the actual mean points larger so that they would stand out more in comparison to the dots. For the geom_dotsintervals, I changed the color to light blue, reduced their size, and lessened their opacity so it didn’t overtake the importance of the mean.
wrangled <- MFID_prob %>%
group_by(SubID, condition) %>%
summarize(rt = mean(rt, na.rm = TRUE),
correct = mean(correct, na.rm = TRUE))
## `summarise()` has grouped output by 'SubID'. You can override using the
## `.groups` argument.
wrangled
3a. colors
wrangled %>%
ggplot(aes(x = rt,
y = correct,
color = condition)) +
geom_point(size = 3,
alpha = 0.6) +
theme_minimal() +
labs(title = "Relationship Between Reaction Time and Accuracy Separated by Condition",
x = "Average Reaction Time",
y = "Proportion Correct",
color = "Condition")
3b. facets in separate plots
wrangled %>%
ggplot(aes(x = rt,
y = correct)) +
geom_point(size = 3,
alpha = 0.6) +
theme_minimal() +
labs(title = "Relationship Between Reaction Time and Accuracy Separated by Condition",
x = "Average Reaction Time",
y = "Proportion Correct",
color = "Condition") +
facet_wrap(~condition, scales = "free")